[MachineLearning] 超参数之LearningRate


Gardient Descent

关于Gradient descent 算法,不打算细说概念,公式什么的.贴一张Andrew的PPT:

图中有几点说明:

  • := 是赋值操作

  • $J(\theta_{0},\theta_{1})$是代价函数

  • $\alpha$是learning rate,它控制我们以多大的幅度更新这个参数$\theta_{J} $.

    当偏导数部分为0时,即已经到达极小值,梯度便不再下降.这也说明$\alpha$保持不变时,梯度下降也可以收敛到局部最低点

  • 梯度下降操作是同时更新 $\theta_{0}$和$\theta_{1}$的.

所以一般在梯度下降算法中,都需要设置一个学习率.

SGD和minibatch-SGD

Stochastic Gradient Descent是随机梯度下降,每次计算只用一个随机样本

minibatch-SGD 一次采用batch size的样本做梯度

Learning rate

学习率决定了在一个小批量(mini-batch)中权重在梯度方向要移动多远.

比如下面Andrew的PPT截图 (图中$ J\left(\theta_{1} \right)$ 是代价函数):

  • LR很小时,训练会变得可靠,也就是说梯度会向着最/极小值一步步靠近.算出来的loss会越来越小.但代价是,下降的速度很慢,训练时间会很长.
  • LR很大时,训练会越过最/极小值,表现出loss值不断震荡,忽高忽低.最严重的情况,有可能永远不会达到最/极小值,甚至跳出这个范围,进入另一个下降区域.

所以选择一个合适的LR是需要不断尝试和调整的.

Andrew提供一些practice的LR选取方法,比如0.001, 0.003, 0.01, 0.03, 0.1等.

总结下来,就是:

  1. 小的LR会更精确
  2. 大的LR的loss下降更快

结合上面两点优点,就会有这样的策略:

在训练刚开始的时候,由于初始的随机权重远离最优值,所以使用大一些的LR,让loss尽快下降接近局部最小值.然后训练过程中把LR调小,允许细粒度的权重更新,找到局部最小值.

调整LR的目的是使loss快速收敛.

Andrew:

在梯度下降法中,当我们接近局部最低点时,梯度下降法会自动采取更小的幅度.

这是因为当我们接近局部最低点时,很显然在局部最低时导数等于零,所以当我们接近局部最低时,导数值会自动变得越来越小.所以梯度下降将自动采取较小的幅度,这就是梯度下降的做法.

所以实际上没有必要再另外减小α,这就是梯度下降算法.你可以用它来最小化任何代价函数, 不只是线性回归中的代价函数J

Initialize Learning Rate

简单的方法就是尝试不同值,看哪个值能让损失函数最优,且不损失训练速度.

一个选择学习率的方法是:以一个低LR开始训练网络,在之后每个batch中指数提高LR,记录每批batch的LR和loss.然后绘制Loss和LR的关系图,从图中找取使Loss最低的LR.

或者

从0.1开始,指数下降LR,0.01,0.001这样尝试.在前几次迭代中,如果出现loss从某一时刻下降,这个学习率就是可用的最大值,超过他的都不能使loss收敛.所以用最大值训练一段时间,随后根据loss值适当地降低LR,让其以更细粒度的权重更新.

Decay Learning Rate

前面提到,在训练的不同阶段,LR是需要调整的.比如,在loss不下降的时候,把LR缩小10倍往往会有效果.

但是,手动调整算哪门子程序员,于是,各种自动调整LR的方法变因此而生.下面介绍Tensorflow中LR的衰减策略.

底部链接里介绍有更多的策略,这里只放Tensorflow文档里常用的5个.

exponential_decay

LR指数衰减是最常用的衰减方法.

exponential_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)
  • learning rate 传入初始LR值
  • global_step 用于计算衰减
  • decay_steps 衰减的周期,每过decay_steps步后做一次衰减
  • decay_rate 每次衰减倍率,用初始LR * decay_rate
  • staircase 阶梯状衰减

计算原理是

decayed_learning_rate = learning_rate *
                        decay_rate ^ (global_step / decay_steps)

如果参数staircaseTrue,则global_step / decay_steps是整除,衰减的LR就遵循阶梯函数.

示例代码

...
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
                                           100000, 0.96, staircase=True)
# Passing global_step to minimize() will increment it at each step.
learning_step = (
    tf.train.GradientDescentOptimizer(learning_rate)
    .minimize(...my loss..., global_step=global_step)
)

示例图

inverse_time_decay

tf.train.inverse_time_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)

倒数衰减.参数同上.

计算方式:

decayed_learning_rate = learning_rate / (1 + decay_rate * t)

示例代码,以0.5的衰减率衰减1 / t:

...
global_step = tf.Variable(0, trainable=False)
learning_rate = 0.1
k = 0.5
learning_rate = tf.train.inverse_time_decay(learning_rate, global_step, k)

# Passing global_step to minimize() will increment it at each step.
learning_step = (
    tf.train.GradientDescentOptimizer(learning_rate)
    .minimize(...my loss..., global_step=global_step)
)

示例图:

natural_exp_decay

tf.train.natural_exp_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)

自然指数衰减.和指数衰减差不多,不过下降的底数是$1/e$

计算公式

decayed_learning_rate = learning_rate * exp(-decay_rate * global_step)

示例代码 decay exponentially with a base of 0.96:

...
global_step = tf.Variable(0, trainable=False)
learning_rate = 0.1
k = 0.5
learning_rate = tf.train.exponential_time_decay(learning_rate, global_step, k)

# Passing global_step to minimize() will increment it at each step.
learning_step = (
    tf.train.GradientDescentOptimizer(learning_rate)
    .minimize(...my loss..., global_step=global_step)
)

示例图(绿色是exponential_decay):

piecewise_constant

tf.train.piecewise_constant(x, boundaries, values, name=None)

分段常数下降法类似于 exponential_decay 中的阶梯式下降法,不过各阶段的值是自己设定的。

其中,x 即为 global step,boundaries=[step_1, step_2, …, step_n] 定义了在第几步进行 lr 衰减,values=[val_0, val_1, val_2, …, val_n] 定义了 lr 的初始值和后续衰减时的具体取值。需要注意的是,values 应该比 boundaries 长一个维度。

示例代码:use a learning rate that’s 1.0 for the first 100000 steps, 0.5
for steps 100001 to 110000, and 0.1 for any additional steps.

global_step = tf.Variable(0, trainable=False)
boundaries = [100000, 110000]
values = [1.0, 0.5, 0.1]
learning_rate = tf.train.piecewise_constant(global_step, boundaries, values)

# Later, whenever we perform an optimization step, we increment global_step.

示例图:

polynomial_decay

tf.train.polynomial_decay(learning_rate, global_step, decay_steps, end_learning_rate=0.0001, power=1.0, cycle=False, name=None)

polynomial_decay 是以多项式的方式衰减学习率的。

计算方式:

The function returns the decayed learning rate. It is computed as:

global_step = min(global_step, decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
                        (1 - global_step / decay_steps) ^ (power) +
                        end_learning_rate

If cycle is True then a multiple of decay_steps is used, the first one
that is bigger than global_steps.

decay_steps = decay_steps * ceil(global_step / decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
                        (1 - global_step / decay_steps) ^ (power) +
                        end_learning_rate

示例代码:decay from 0.1 to 0.01 in 10000 steps using sqrt (i.e. power=0.5):

...
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
end_learning_rate = 0.01
decay_steps = 10000
learning_rate = tf.train.polynomial_decay(starter_learning_rate, global_step,
                                          decay_steps, end_learning_rate,
                                          power=0.5)
# Passing global_step to minimize() will increment it at each step.
learning_step = (
    tf.train.GradientDescentOptimizer(learning_rate)
    .minimize(...my loss..., global_step=global_step)
)

示例图:

cycle=False,其中红色线为 power=1,即线性下降;蓝色线为 power=0.5,即开方下降;绿色线为 power=2,即二次下降

Experience

在用new Yolo训练自己的训练集时(3个class),的确遇到了loss不下降的情况.大概在不到2000次step的时候loss就已经维持在1.2-1.5之间,随后的几千次step都没有看到loss有下降的趋势.

于是就下调LR,5倍,10倍这样下调.从最开始的5e-5调整到1e-7时(同时增加了一倍的batch size),loss出现比较明显的下降,整体维持在0.8-0.9左右,偶尔会出现0.5,0.6这样的值,而且随着训练增多,这些小值的比例在增加.

另外,观察loss需要有耐心,可能到某一个阶段,它整体在下降,但速度很慢而已.

贴上一段darkflow中的问答:

what is the lowest loss value can reach?

Q:

hi, I have trained a yolo-small model to step 4648, but most of loss
values are greater than 1.0, and the result of test is not very well. I
want to know how well can loss value be, and could you please show some
key parameters when training, e.g learning rate, training time, the
final loss value, and so on.

A:

What batch size are you using? Because without the batch size, step number cannot say anything about how far you’ve gone. According to the author of YOLO, he used pretty powerful machine and the training have two stages with the first stage (training convolution layer with average pool) takes about a week. So you should be patient if you’re not that far from the beginning.

Training deep net is more of an art than science. So my suggestion is you first train your model on a small data size first to see if the model is able to overfit over training set, if not then there’s a problem to solve before proceeding. Notice due to data augmentation built in the code, you can’t really reach 0.0 for the loss.

I’ve trained a few configs on my code and the loss can shrink down well from > 10.0 to around 0.5 or below (parameters C, B, S are not relevant since the loss is averaged across the output tensor). I usually start with default learning rate 1e-5, and batch size 16 or even 8 to speed up the loss first until it stops decreasing and seem to be unstable.

Then, learning rate will be decreased down to 1e-6 and batch size increase to 32 and 64 whenever I feel that the loss get stuck (and testing still does not give good result). You can switch to other adaptive learning rate training algorithm (e.g. Adadelta, Adam, etc) if you feel like familiar with them by editing ./yolo/train.py/yolo_loss()

You can also look at the learning rate policy the YOLO author used, inside .cfg files.

Best of luck

Reference

如何估算深度神经网络的最优学习率
Tensorflow 中 learning rate decay 的奇技淫巧
Decaying the learning rate


文章作者: Wossoneri
版权声明: 本博客所有文章除特別声明外,均采用 CC BY-NC 4.0 许可协议。转载请注明来源 Wossoneri !
评论
  目录